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Abstract 

An index of uniformity is developed as an alternative to the maximum-entropy 
principle for selecting continuous, differentiable probability distributions V subject 
to constraints C. The uniformity index developed in this paper is motivated by the 
observation that among all differentiable probability distributions defined on a hnite 
interval [a, 6] G M, it is the uniform probability distribution that minimizes the path 
length of the associated cumulative distribution function F-p on [a,b]. This intuition 
is extended to situations where there are constraints on the allowable probability 
distributions. In particular, constraints on the first and second raw moments of a 
distribution are discussed in detail, including the analytical form of the solutions and 
numerical studies of particular examples. The resulting ’’shortest path” distributions 
are found to be decidedly more heavy-tailed than the associated maximum-entropy 
distributions, suggesting that entropy and ”CDF path length” measure two different 
aspects of uncertainty for bounded distributions. 
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1 Introduction 


There are numerous situations were we seek a probability model to represent uncertainty, 
yet we only feel comfortable making a few assumptions about the nature of that uncertainty. 
In such situations, the choice of probability distributions is underdetermined, requiring ad¬ 
ditional criteria to pick out a single distribution from among those consistent with the 
assumed facts. A common criterion is the maximum entropy principle (MEP), which rec¬ 
ommends selecting the distribution that maximizes the Shannon Entropy, which is dehned 
for discrete probability distributions (P^) as: 

H{X) := Pp,[-ln(P,(X))] = - Pd{xi)\n{Pa{xi)) 

Xi&X 

For continuous distributions, one uses the Shannon Differential Entropy, which is an 
extension of the Shannon Entropy to probability density functions f{x): 


h{X) := 


f{x) \og{f{x))dx 


lx 


The Shannon Entropy has several intuitively appealing properties when viewed as a 
measure of information content: continuity, additivity, and maximality under a uniform 
distribution. It also has close connections to data processing and statistical mechanics, 
and it can be shown to satisfy several types of consistency criteria (Uffink ( 1996| )). In his 
seminal paper introducing information theory Shannon (1948) cited three properties that 
guided the development of the Shannon Entropy (H) for a discrete probability distribution 
Pn{xi), dehned on a hnite set {x,}, i G 


1. H should be continuous in the Pn{xi) 

2. If all Pn{xi) are equal, then H should be a monotonic increasing function of n: if X 
has n equally likely outcomes and Y has m > n equally likely outcomes, then Y has 
more uncertainty is than X. 

3. The value of H should not depend on how we structure the process generating the 
observable distribution. For example, if we have a random variable X G {1,2, 3,4}, 
we can model a measurement of X as resulting from a single chance event with 
four possible outcomes, or as a sequence of two dependent events, each with two 
possibilities (conditional on the hrst outcome). If we are not allowed to look at the 
intermediate process, then the entropy of X should be the same for each model. 


Due to the intuitively appealing properties of entropy noted by Shannon and others, 
both Shannon Entropy and Differential Shannon Entropy have been recommended as tools 
for developing uncertainty distributions (Jaynes (1957), Jaynes (1982)). Maximum-entropy 
(ME) distributions are dehned by their constraints, with several common examples given 
below: 


• X ~ Uniform if X G [a, 6], —cx) < a < b < oo 
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• X ~ Beta if X G [a, 6], —oo < a < 6 < oo, E[X] = /i, V[X] = cr^ 

• X ~ Exponential if X G [0, oo), -E'[X] = /i 

• X ~ Normal if X G (—oo, oo), E[X] = /i, E[X] = 


The fact that the normal distribution emerges as a ME distribution is quite a satisfying 
result from a statistical and mathematical standpoint, as it concords with the ubiquity of 
this distribution and provides support for its use as a ’’minimally informative” choice in 
modeling. 

However, while entropy as a measure of information content is well-established, its ap¬ 
plicability to modeling uncertainty is less clear cut. In particular, Ufhnk (1996) notes that 
the Shannon Entropy is just one of many ” entropy-like” measures out there. Other exam¬ 
ples are distributions maximizing the class of Renyi-entropies, of which Shannon Entropy 
is a special case, or distributions that are invariant under a pre-specihed transformation. 
What is interesting is that these forms of ’’entropy” also satisfy very plausible assumptions 
about how a measure of information should behave, although the details of these assump¬ 
tions differ among the models (Ufhnk (1995)). Finally, even within the ME paradigm, the 
form of the resulting entropy maximizer depends critically on the form of the constraints 
(Ufhnk ( |1995 )), implying that the MEP does not remove subjectivity as much as push it 
down one level. 

In addition, from a decision-theoretic viewpoint, it is not clear how a ME distribution 
is affecting the decisions. Is it making us more conservative, less conservative? Are we 
emphasizing particular outcomes by maximizing the entropy? If so, why are those outcomes 
justihed in being given more weight, and in particular the specihc weights prescribed by 
the MEP? The above questions are not intended to be a pointed critique, and certainly 
not any form of rebuttal, of the MEP or the resulting distributions. Instead, the intent is 
merely to highlight the ambiguities inherent in attempting to model ignorance as opposed 
to information. The ambiguity is exacerbated by the existence of alternative principles 
of similar intuitiveness or level of justihcation (e.g., Jeffrey’s invariance principle, Renyi- 
entropies). 

The novel approach presented in this paper no doubt also possesses its own set of ambi¬ 
guities and unanswered questions. However, like other methods for modeling uncertainty, 
it has several features that may appeal to various practitioners and/or be particularly 
applicable in certain contexts. 


2 Uniformity 

The approach taken in this paper seeks to maximize the uniformity of a probability dis¬ 
tribution, where ’’uniform” is meant in the sense of the ’’equal probability” of a random 
variable for various outcomes. In particular, this paper focuses on maximizing the unifor¬ 
mity of bounded, continuous-valued random variables with smooth distribution functions 
(i.e., the cumulative distribution function (CDF) is differentiable over the entire domain). 
Like other measures of uncertainty, uniformity is developed from a base of intuitively 
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plausible presuppositions and then extended to less familiar domains. In particular, any 
uniformity measure for a continuous, bounded random variable should possess two very 
basic properties: 

1. A degenerate distribution (i.e., step function CDF) is the least uniform. 

2. The uniform distribution (not surprisingly) is the most uniform distribution. 

Given a measure of uniformity, we formulate the central ’’principle” for choosing a 
distribution using this measure: 

Principle of Maximum Uniformity: Of all continuous, hnite-domain distribution 
functions that are consistent with the available information, choose the one with the high¬ 
est measure of uniformity. 

The rationale for this principle is that we want to choose a distribution ”as uniform as 
possible” from among the available distributions, so that we are not, a priori^ overempha¬ 
sizing any regions of the sample space beyond what is necessary to be consistent with our 
prior assumptions or available information. 

To actually implement this principle, we need to quantify how close a given distribution 
is to a uniform distribution. The following observation will prove instrumental to the 
dehnition of such a quantity: the set of points {{x, Fu{x)) : x G [a,b]} generated by the 
CDF of the uniform density function, Fu, on [a, b] takes the shortest possible path from 
(^o,yo) = (O; 0) to (xi,yi) = (b,l). This is a dehning feature of the uniform CDF (and of 
straight lines in general). Hence, we used the path length of the CDF as the basis for our 
quantity: 

Definition: Uniformity Index Given a continuous random variable X G [a, b], with 
probability density function fx, the uniformity index of fx is dehned as: 

^{fx) := -p 

J a 

The uniformity index is simply the ratio of the path length of the uniform CDF to 
the path length of the CDF of X {Fx) on [a, 6]. It satisfies the two basic properties of a 
uniformity measnre listed earlier: the longest path is the step function CDF associated with 
the degenerate density function Sa{x) and the shortest path is produced by the uniform 

CDF on [a,b]. In general, ^ 0.707 < < U{fx) < 1- Note that for a given 

domain [a, 6], the nniformity index varies only as a fnnction of the path length of fx', hence, 
maximizing the nniformity is equivalent to hnding the Shortest Path Distribution (SPD) 
that satishes a given set of constraints C. 
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3 SPD under Raw Moment Constraints 


Given a domain [a, b] and set of constraints C, we are seeking to minimize the path length 
L[F] of the cumulative distribution function F dehned on [a,b], subject to a set of con¬ 
straints C (i.e., F is part of a family of distributions Vc satisfying C): 


argmin L [F] = argmin 
FePc F&Vc 





2 

dx 


This type of problem is typical in the Calculus of Variations, where we are seeking to hnd 
a univariate function f{x) that minimizes a functional /[/] = G{x, f, f')dx . Candidates 
for optimal functions are identified as solutions to the Euler-Lagrange equation: 

dG _ d fdG\ _ 
df dx \df') 

In general, the Euler-Lagrange equation is merely a necessary condition for a minimum. 
Second-order conditions are required to verify that / is indeed a minimum and not a 
maximum or merely a stationary point (think ’’plateaus” of some cubic polynomials). A 
necessary condition for / being a minimum of /[/] is the Legendre Condition: 

d^G 

g^>0ioTxe [a,b] 

This paper will focus on solving the above problem for the path length functional L[F] 
under constraints on the raw moments of f = F' and a basic non-negativity constraint on 
the probability density function / (i.e., f{x) > 0 Vx G [a,b]). The functions associated with 
the SPD constraints on all raw moments up to m is given by the vector-valued function 

^m. 


/ x''f{x)dx-ix, 


where p = (po, Ail, •••, hm) and po = 1 (1) 


2=1 


The constraints in this problem are referred to as isoperimetric constraints (alluding 
to the motivating problem of Ending a shape that maximizes the enclosed area given a 
fixed perimeter length). As with classical constrained optimization, constraints can be 
incorporated using Lagrange Multipliers, Aj, which turn a constrained problem into an un¬ 
constrained problem. In particular, the integrand for isoperimetric constraint i is multiplied 
by \i and added to the integrand of L[F] to form the Lagrangian, (j), which will be used to 
find the optimal function; 


f, A] = ^/l + f{xy + \ixj{x), where f{x) = F\x) (2) 

i=0 

Applying the Euler-Lagrange equation to (f, we can derive the general form of a solution: 
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d(f) d 
df dx 


d4> 

w 


m 


f 


/TTIfp 




x' = 0 


i=0 




d 

dx 


F{x;X) 


ET=o 


Vi-(Er=oW 


( 3 ) 


Where in the last equation, we took the positive root of the numerator, as the sign 
is inconsequential with the undetermined multipliers. We will now check the Legendre 
condition for 0: 


(^V^ + f{xy + j = 0 

This result indicates that our solution is not ruled out as a minimum. We can informally 
check the actual solutions by comparing them to a trial solution. If the solution has a longer 
path length than our trial solution, we know its not a minimum. However, if it has a shorter 
length, then we are likely looking at a minimum. This is the approach we will use on the 
specihc solutions derived in this paper. 

We will now need to solve the system of integral constraints for the vector of Lagrange 
multipliers A := (Aq, Ai,..., A„), such that the ([^ is non-negative and satishes the isoperi- 
metric constraints Q : 


Find A : 




1 - (Er=o 


> 0, Vx G [a, b]; C'^‘ 


Er.o 




: j fl 


= 0 


4 Numerical Solution 

The form of the density function precludes a simple analytical solution for A. Instead, 
we numerically estimated A using a piecewise uniform approximation /*(x;A) of the true 
density function /(x;A). First, the domain of / was broken into n equal-length intervals 
Aj = [xi-i,Xi) = [a + {i — l)6n,a + idn], i = l...n, where 6n = Each interval has a 
midpoint, pi = °^+h-l)<?n+a+^5n ^ 2a+(2t-i)<5„ ^ jg ^gg^i calculating the value of the 

density over that interval. Using this approach, we were able to replace the isoperimetric 
constraints (which are expressed as integrals) with Riemann sums over the partition {Aj}: 


cr 


x^f{x)dx - pk 


/b ^ / C 

x^f*{x)dx - l^k = ^yj 


f{pi)x^dx 




i=l 


™fc+i _ ™fc+i 

k + 1 




(f(p;A),<5^)-p,:=M„^(A) 


(4) 
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where ^ := f z^dz 

Jxi-l 

Each summand on the left hand side of Q represents the contribution of the interval 
Aj to the raw moment of the piecewise uniform approximation to /. Note that this 
reduces to the simple Riemann sum over / with partition length 6n in the case of the 
’’zeroth” moment {k = 0). For a given partition size n, we can estimate A using a least 
squares solution to the constraints of the approximate problem: 

m 

argmin (5) 

^ i=i 

Theoretically, this is all we need. However, from a numerical perspective, the form of 
(|^ as a function of A has two problems : 

1. The numerator can become negative 

2. The denominator can become complex 

Therefore, we needed to dehne a feasible region for A to ensure we got physically sensible 
answers. This was be accomplished using the following constraint set on A: 

m 

0<5^Aip}<l, j = l...u (6) 

i=0 

Conveniently, (|^ dehnes two linear half-spaces, which dehnes a convex polytope in M™. 

We solved the above formulation using the R statistical programming language to run 
the auglag nonlinear optimization routine in the alabama package, with Nelder-Mead as 
the selected optimization method. The algorithm converged in some specihc instances; 
however, numerical tests indicated that directly using the parameterized form of the den¬ 
sity f{pi', A) in the optimization model induced instability and poor convergence for some 
values of the constraints. Therefore, we developed a second approach, which took a more 
direct, high-dimensional optimization route. 

In the second approach, we used the same set of partitions {Aj} and midpoints pt as 
in the hrst approach, but we treated the density values at each midpoint (i.e., f{pi)) as a 
decision variable. This change resulted in an n variable minimization problem in f{pi) := fi, 
where we are approximating the true density function f{x] A) by a piecewise constant 
function f*{x) that takes value /j if x G Aj. If a smooth, continuous solution exists to the 
true optimization problem, then the sequence of approximate solutions will also converge 
(pointwise): lim = /(x) Vx G [a, 6], so /*(x) —)• /(x) pointwise. The 

n^oo 

new optimization model becomes: 

” /- 

argmin ^ Y (1 + fi) = 0; 0 < A; < m; /j > 0, i = l...n (7) 

h j=i 


7 



Where := (f, 5^) — as in Q, but now the vector of density values are the actual 
decision variables, not values parameterized over A. This more direct approach (using R 
with auglag and the nlminb optimization routine) had better convergence properties for 
the problems we investigated in this paper, with hrst and second order KKT conditions 
being consistently satisfied. However, we felt that the analytical solution to the Euler- 
Lagrange equation still has theoretical and conceptual value, despite not leading to the 
best numerical approach. 

The next section will show the typical shape of the resulting SPDs under low-order 
moment constraints. Each of these will be graphically compared to their equivalent ME 
distribution and the CDF lengths will be calculated for each. The paper will end with a 
discussion of the key insights from this investigation. 


5 Typical solutions for m G {0,1,2} 

5.1 Uniform (m = 0) 

This is the simplest case, where we only require that the probability equal 1. The resulting 
SPD is a uniform distribution, which is identical to the ME distribution. The uniform 
distribution is trivially the shortest path distribution under these conditions. 

5.2 Exponential (m = 1) 

We start to see some more structure when we also constrain the mean. In this case, we 
are searching for a SPD on [0,0.1] with mean 0.04. The equivalent ME distribution is the 
truncated exponential distribution TEXP{X,b), where A = 12.3,6 = 0.1. (Fig. 1) The 
path length for the truncated exponential solution is 1.017, whereas the path length of the 
SPD distribution is 1.006, verifying that indeed the SPD distribution has a shorter path 
length. 

While absolute path-length differences appear slight, this is primarily due to an issue 
of scale. To correct for this, we calculated the ratio of each distribution’s respective dif¬ 
ference from the CDF path length of a uniform distribution on [0, .01] (i.e., this would be 
fyT+”(0Tfy = 1.005), to get a ’’difference ratio” (r) of 11. The general formula is shown 
below: 


Difference Raiio^ f me, fsp) ■ = 


LIIme] — yi + (6 — g)^ 

L[fsp] — a /1 + {b — a )2 


( 8 ) 


In this case, a difference ratio of 11 means that the SPD CDF path length is 11 times 
closer to the straight-line distance (shortest possible path) than the CDF path length of 
the associated truncated exponential distribution. Essentially, we are using the path length 
of the uniform CDF over [0, .01] as the point of reference, as 0 is not a feasible path length. 
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Figure 1: SPD (solid line) and ME distribution (dotted line) on [0,0.1] for mi = 0.04 

As an additional exploration, we also examined how the solutions change as the upper 
bound b of the random variable is increased. 



Figure 2: Sequence of SPDs for mi = 0.04 with increasing upper bounds (solid lines) and 
ME distribution for unbounded case (dotted line). 

From Figure l^it appears that qualitatively the SPD becomes increasingly more peaked, 

suggesting that lim SPD{mi = 0.04) = 5(0). However, ^[5(0)] = 0 0.04, which would 

6^00 
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imply that there is no unbounded SPD (consistent with the fact that the sequence path 
lengths of the associated CDFs is unbounded). 

5.3 Bell and Bowl (m = 2) 

This was the most complex case examined. We calculated two SPDs on [—0.1,0.!] with 
mean 0 and second-moments {heiewariance) equal to 0.001 and 0.005, respectively. The 
equivalent ME distributions are Beta distributions on [—0.1,0.!] with parameters (ai = 
(5i = 4.5) and (02 = (^2 = 0.5) , respectively. We called the hrst case the ’’Bell” and the 
second case the ’’Bowl” Beta distributions to indicate their qualitative shapes. We found 
that the SPD with the same hrst and second moments as the Bell Beta distribution was 
markedly more peaked (Figure]^, while the Bowl Beta distribution and moment-matched 
SPD were almost identical (Figure]^. 



X 


Figure 3: SPD for mi = 0, m 2 = 0.001 on [—0.1, 0.1] (solid line) and Bell Beta (dotted line) 
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Figure 4: SPD for rrii = 0,m 2 = 0.005 on [—0.1,0.!] (solid line) and Bowl Beta (dotted 
line) 

The path length was 1.06 for the Bell Beta CDF and 1.04 for the moment-matched 
SPD CDF, yielding a difference ratio of 2. The Bowl Beta and moment-matched SPD 
CDFs had the same path length of 1.02. Similar to the case with m = 1, we examined the 
sequence of Bell-shaped SPDs with increasing support (Figure and compared them to 
their moment-matched normal distribution. 
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Figure 5: Sequence of SPDs (mi = 0,m2 = 0.001) with symmetrically increasing support 
(solid lines). Moment-matched normal distribution with unbounded support (dotted line). 

As shown in Figure [sj it appears that lim SPD{rni = 0,m2 = 0.001) = 5(0), similar 

I I 6—>-oo 

to the hxed-mean SPD m the previous section. Again, this suggests that the SPDs must 
be bounded. 


6 Discussion 

We see that the SPDs do not, as a general rule, match the results given by the Maximum 
Entropy Principle. Therefore, how can we interpret the difference between an SPD and 
its associated ME distribution? In broad terms, it appears that SPDs emphasize ’’fragile 
stability” while the ME distributions exhibit ’’stable instability”. A process operating ac¬ 
cording to a ME distribution will rapidly ”£11” its outcome space, reaching its equilibrium 
distribution quickly due to its high entropy. In contrast, a process operating according to 
an SPD will appear to be relatively stable and well contained, only to suddenly produce 
an extreme outlier (relative to the central 95% of the distribution). 

The punctuated/catastrophic behavior of SPDs is reminiscent of the types of behavior 
discussed in Nassim Taleb’s The Black Swan, where Taleb argues that the variance between 
outcomes is a poor measure of risk if the underlying distribution has high kurtosis (thus 
high tail-risk). It is interesting that taking a very geometric/literal approach to uniformity 
produces such different densities when compared to distributions based on maximizing an 
abstract measure. The generality of this behavior is an open question: does it arise only 
with path length, or does it arise whenever one is optimizing a metric as opposed to a 
measure as a function of an input distribution. Regardless, the results of this analysis 
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point out that it is far from clear how ’’uncertainty” is best quantihed, with different, 
plausible approaches yielding very different results. By emphasizing extreme behavior, 
SPDs provide a novel perspective on the relationship between risk, ignorance, probability, 
and uncertainty 
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